48 research outputs found
Power Normalizing Second-order Similarity Network for Few-shot Learning
Second- and higher-order statistics of data points have played an important
role in advancing the state of the art on several computer vision problems such
as the fine-grained image and scene recognition. However, these statistics need
to be passed via an appropriate pooling scheme to obtain the best performance.
Power Normalizations are non-linear activation units which enjoy
probability-inspired derivations and can be applied in CNNs. In this paper, we
propose a similarity learning network leveraging second-order information and
Power Normalizations. To this end, we propose several formulations capturing
second-order statistics and derive a sigmoid-like Power Normalizing function to
demonstrate its interpretability. Our model is trained end-to-end to learn the
similarity between the support set and query images for the problem of one- and
few-shot learning. The evaluations on Omniglot, miniImagenet and Open MIC
datasets demonstrate that this network obtains state-of-the-art results on
several few-shot learning protocols
Zero-Shot Kernel Learning
In this paper, we address an open problem of zero-shot learning. Its
principle is based on learning a mapping that associates feature vectors
extracted from i.e. images and attribute vectors that describe objects and/or
scenes of interest. In turns, this allows classifying unseen object classes
and/or scenes by matching feature vectors via mapping to a newly defined
attribute vector describing a new class. Due to importance of such a learning
task, there exist many methods that learn semantic, probabilistic, linear or
piece-wise linear mappings. In contrast, we apply well-established kernel
methods to learn a non-linear mapping between the feature and attribute spaces.
We propose an easy learning objective inspired by the Linear Discriminant
Analysis, Kernel-Target Alignment and Kernel Polarization methods that promotes
incoherence. We evaluate performance of our algorithm on the Polynomial as well
as shift-invariant Gaussian and Cauchy kernels. Despite simplicity of our
approach, we obtain state-of-the-art results on several zero-shot learning
datasets and benchmarks including a recent AWA2 dataset.Comment: IEEE Conference on Computer Vision and Pattern Recognition 201
CNN-based Action Recognition and Supervised Domain Adaptation on 3D Body Skeletons via Kernel Feature Maps
Deep learning is ubiquitous across many areas areas of computer vision. It
often requires large scale datasets for training before being fine-tuned on
small-to-medium scale problems. Activity, or, in other words, action
recognition, is one of many application areas of deep learning. While there
exist many Convolutional Neural Network architectures that work with the RGB
and optical flow frames, training on the time sequences of 3D body skeleton
joints is often performed via recurrent networks such as LSTM.
In this paper, we propose a new representation which encodes sequences of 3D
body skeleton joints in texture-like representations derived from
mathematically rigorous kernel methods. Such a representation becomes the first
layer in a standard CNN network e.g., ResNet-50, which is then used in the
supervised domain adaptation pipeline to transfer information from the source
to target dataset. This lets us leverage the available Kinect-based data beyond
training on a single dataset and outperform simple fine-tuning on any two
datasets combined in a naive manner. More specifically, in this paper we
utilize the overlapping classes between datasets. We associate datapoints of
the same class via so-called commonality, known from the supervised domain
adaptation. We demonstrate state-of-the-art results on three publicly available
benchmarks
Dictionary Learning and Sparse Coding for Third-order Super-symmetric Tensors
Super-symmetric tensors - a higher-order extension of scatter matrices - are
becoming increasingly popular in machine learning and computer vision for
modelling data statistics, co-occurrences, or even as visual descriptors.
However, the size of these tensors are exponential in the data dimensionality,
which is a significant concern. In this paper, we study third-order
super-symmetric tensor descriptors in the context of dictionary learning and
sparse coding. Our goal is to approximate these tensors as sparse conic
combinations of atoms from a learned dictionary, where each atom is a symmetric
positive semi-definite matrix. Apart from the significant benefits to tensor
compression that this framework provides, our experiments demonstrate that the
sparse coefficients produced by the scheme lead to better aggregation of
high-dimensional data, and showcases superior performance on two common
computer vision tasks compared to the state-of-the-art.Comment: 13 pages, NIP
Model Selection for Generalized Zero-shot Learning
In the problem of generalized zero-shot learning, the datapoints from unknown
classes are not available during training. The main challenge for generalized
zero-shot learning is the unbalanced data distribution which makes it hard for
the classifier to distinguish if a given testing sample comes from a seen or
unseen class. However, using Generative Adversarial Network (GAN) to generate
auxiliary datapoints by the semantic embeddings of unseen classes alleviates
the above problem. Current approaches combine the auxiliary datapoints and
original training data to train the generalized zero-shot learning model and
obtain state-of-the-art results. Inspired by such models, we propose to feed
the generated data via a model selection mechanism. Specifically, we leverage
two sources of datapoints (observed and auxiliary) to train some classifier to
recognize which test datapoints come from seen and which from unseen classes.
This way, generalized zero-shot learning can be divided into two disjoint
classification tasks, thus reducing the negative influence of the unbalanced
data distribution. Our evaluations on four publicly available datasets for
generalized zero-shot learning show that our model obtains state-of-the-art
results
Artwork Identification from Wearable Camera Images for Enhancing Experience of Museum Audiences
Recommendation systems based on image recognition could prove a vital tool in
enhancing the experience of museum audiences. However, for practical systems
utilizing wearable cameras, a number of challenges exist which affect the
quality of image recognition. In this pilot study, we focus on recognition of
museum collections by using a wearable camera in three different museum spaces.
We discuss the application of wearable cameras, and the practical and technical
challenges in devising a robust system that can recognize artworks viewed by
the visitors to create a detailed record of their visit. Specifically, to
illustrate the impact of different kinds of museum spaces on image recognition,
we collect three training datasets of museum exhibits containing variety of
paintings, clocks, and sculptures. Subsequently, we equip selected visitors
with wearable cameras to capture artworks viewed by them as they stroll along
exhibitions. We use Convolutional Neural Networks (CNN) which are pre-trained
on the ImageNet dataset and fine-tuned on each of the training sets for the
purpose of artwork identification. In the testing stage, we use CNNs to
identify artworks captured by the visitors with a wearable camera. We analyze
the accuracy of their recognition and provide an insight into the applicability
of such a system to further engage audiences with museum exhibitions
Fisher-Bures Adversary Graph Convolutional Networks
In a graph convolutional network, we assume that the graph is generated
wrt some observation noise. During learning, we make small random perturbations
of the graph and try to improve generalization. Based on quantum
information geometry, can be characterized by the
eigendecomposition of the graph Laplacian matrix. We try to minimize the loss
wrt the perturbed while making to be effective in
terms of the Fisher information of the neural network. Our proposed model can
consistently improve graph convolutional networks on semi-supervised node
classification tasks with reasonable computational overhead. We present three
different geometries on the manifold of graphs: the intrinsic geometry measures
the information theoretic dynamics of a graph; the extrinsic geometry
characterizes how such dynamics can affect externally a graph neural network;
the embedding geometry is for measuring node embeddings. These new analytical
tools are useful in developing a good understanding of graph neural networks
and fostering new techniques.Comment: Published in UAI 201
Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons (Extended Version)
In this paper, we explore tensor representations that can compactly capture
higher-order relationships between skeleton joints for 3D action recognition.
We first define RBF kernels on 3D joint sequences, which are then linearized to
form kernel descriptors. The higher-order outer-products of these kernel
descriptors form our tensor representations. We present two different kernels
for action recognition, namely (i) a sequence compatibility kernel that
captures the spatio-temporal compatibility of joints in one sequence against
those in the other, and (ii) a dynamics compatibility kernel that explicitly
models the action dynamics of a sequence. Tensors formed from these kernels are
then used to train an SVM. We present experiments on several benchmark datasets
and demonstrate state of the art results, substantiating the effectiveness of
our representations
A Deeper Look at Power Normalizations
Power Normalizations (PN) are very useful non-linear operators in the context
of Bag-of-Words data representations as they tackle problems such as feature
imbalance. In this paper, we reconsider these operators in the deep learning
setup by introducing a novel layer that implements PN for non-linear pooling of
feature maps. Specifically, by using a kernel formulation, our layer combines
the feature vectors and their respective spatial locations in the feature maps
produced by the last convolutional layer of CNN. Linearization of such a kernel
results in a positive definite matrix capturing the second-order statistics of
the feature vectors, to which PN operators are applied. We study two types of
PN functions, namely (i) MaxExp and (ii) Gamma, addressing their role and
meaning in the context of nonlinear pooling. We also provide a probabilistic
interpretation of these operators and derive their surrogates with well-behaved
gradients for end-to-end CNN learning. We apply our theory to practice by
implementing the PN layer on a ResNet-50 model and showcase experiments on four
benchmarks for fine-grained recognition, scene recognition, and material
classification. Our results demonstrate state-of-the-art performance across all
these tasks
6DoF Object Pose Estimation via Differentiable Proxy Voting Loss
Estimating a 6DOF object pose from a single image is very challenging due to
occlusions or textureless appearances. Vector-field based keypoint voting has
demonstrated its effectiveness and superiority on tackling those issues.
However, direct regression of vector-fields neglects that the distances between
pixels and keypoints also affect the deviations of hypotheses dramatically. In
other words, small errors in direction vectors may generate severely deviated
hypotheses when pixels are far away from a keypoint. In this paper, we aim to
reduce such errors by incorporating the distances between pixels and keypoints
into our objective. To this end, we develop a simple yet effective
differentiable proxy voting loss (DPVL) which mimics the hypothesis selection
in the voting procedure. By exploiting our voting loss, we are able to train
our network in an end-to-end manner. Experiments on widely used datasets, i.e.,
LINEMOD and Occlusion LINEMOD, manifest that our DPVL improves pose estimation
performance significantly and speeds up the training convergence